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Abstract 

iNicholls and Gravl (| 20081 ) describe a phylogenetic model for trait data. The y use their model 
to est imate branching times on Indo-European language trees from lexical data. lAleksevenko et aD 
(|2008h extended the model and give applications in genetics. In this paper we extend the inference 
to handle data missing at random. When trait data are gathered, traits are thinned in a way that 
depends on both the trait and missing-data content. INicholls and Gravl l|2008l ) treat missing records 
as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in th eir cross- 
validation. Our prediction is consistent with the historical record. INicholls and Gravl lj2008T l dropped 
se ven languages with too much missing data. We fit all twenty four languages in the lexical data 
of iRinge et all <|2002h . In order to model spatial-temporal rate heterogeneity we add a catastrophe 
process to the model. When a language passes through a catastrophe, many traits change at the same 
time. We fit the full model in a Bayesian setting, via MCMC. We validate our fit using Bayes factors 
to test known age constraints. We reject three of thirty historically attested constraints. Our main 
result is a unimodel posterior distribution for the age of Proto-indo-European centered at 8400 years 
BP with 95% HPD equal 7100-9800 years BP. 

The Indo-European languages descend from a common ancestor called Proto-Indo-European. Lexical 
data show the patterns of relatedness among Indo-European languages. These data are "cognacy classes": 
a pair of words in the same class descend, through a process of sound change, from a common ancestor. 
For example, English sea and German See are cognate to one another, but not to the French mer. 
Gray and Atkinson! 1 2003h coded data of this kind in a matrix in which rows correspond to languages and 



columns to distinct cognacy classes, and entries are zero or one as the language possessed or lacked a term 
in the column class. They analysed these data using phylogenetic algorithms similar to those used for 
genetic data. Our analysis h as the same objecti ves, but we fit a model designed for lexical trait data. We 
work with data compiled by IRinge et all <|2002h . recording the distribution of some 872 distinct cognacy 



classes in twenty four modern and ancient Indo-European languages. In section we give estimates for 
the unknown topology and b ranching times of the phylogeny of the core vocabulary of these languages. 
Nicholls and Gray] ( 2008h analyse the same data using a closely related stochastic Dollo- model for 



binary trait evolution. However, those authors were unable to deal with missing trait records. Missing data 
arise when we are u nable to answer the question "does language X possess a cognate in cognacy class Y?". 
Nicholls and Gray ( 2008h dropped seven languages which had many missing entries, and treated missing 



trait records in the remainder as absent traits. This is unsatisfactory. However, it is not straightforward t o 
give a model-based integration of missing data for the trait evolution model of Nicholls and Gray 1 20081 ) . 



In this paper we integrate the missing trait data, and this technical advance allows us to fit all twenty 
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four of the l anguages in the original d ata. The binary trait model of Nicholls and Gray ( 2008h . has been 
extended by Aleksevenko et al. 1 2008f ) to mutliple-level traits, and is finding applications in biology. A 
proper treatment of missing data will be of use in other applications. 

We are specifically interested in phylogenetic dating. Because we are working with lexical, and not 
syntactic data, it is the age of the branching of the core vocabulary of Proto-Indo-European that we 
estimate. This is a controversial matter. Workers in historical linguistics have evidence from linguistic 
paleontology that the most recent common ancestor of all kn own Indo-Euro pean languages branched no 



1989( l . For a recent review o f the 



earlier than about 6000-6500 years Before the Present (BP) ijMalloryl . _ 
argument from linguistic paleon tology, and a criticism of phylogenetic dating, see Garrett 1 20061 ) and 
McMahon and McMahonl fe()05l ). An altern ative hypothesi s suggests that the spread began around 8500 



BP when the Anatolians mastered farming ( Renfrew . 19871 ) in the early neolithic. Recent efforts to apply 
qua ntitative phylogenetic methods to dating Proto-Indo -European give a time depth of 8000 to 9500 years 
BP 1 Nicholls and Gray . 20081 : Gray and Atkinson! . 20031 ). supporting the link to farming. In this work we 
obtain a unimodal posterior distribution for the age of Proto-Indo-European centered at 8400 years BP 
with 95% HPD equal 7100-9800 years BP 

One point of view is that the argument from linguistic paleontology is correct, and the phylogenetic 
dates are incorrect, due to rate heterogeneity , or some other model failing. In this respect we advance 
the search started in Nicholls and Gray ( 20081 ) for a model mispecification which might explain the 20% 
difference the central age estimates from our fitting, and the status quo. The difference is large enough 
that we are hopeful of finding a single coherent error, if any such error exists. Accounting now for missing 
data, we are able to publish a much wider cross-validation test (up from 10 to 30 tested calibrations). 
Also, we fit a model for explicit rate heterogeneity in time and space. In view of the spatial-temporal 
homogeneity we find here and in other data, the case for the earlier date seems now fairly strong. The most 
probable alternative seems to be a step change in the rate of lexical diversification acting in a coordinated 
fashion across the Indo-European territory some 3000 to 5000 years ago. 

In Sections [Tl and |2~T1 we describe the data and specify a subjective prior for the phylogeny of vocabu- 
laries. In sections [2.21 and l3j we give a generative model for the data and the corresponding likelihood. We 
include, in Section 02 a recursive algorithm which makes the sum over missing data tractable. In sections 
OH and 03 we give the posterior distribution on tree and parameter space, and briefly describe our MCMC 
sampler. In section 03 we discuss likely model mispecification scenarios, test for robustness by fitting 
synthetic data simulated under such conditions, and cross-validate our predictions. We fit the model to 
a data set of Indo-European languages in sectio n E This paper has a supplement giving an analysis of a 



second data set, collected by Dven et al. 1 19971 ) 



P hylogenetic methods have been u sed to make inference for t ree dRinge et a . , 20021) and network struc- 
ture (McMaho n and McMahonl . 120051 ) in Historical Linguistics. IWarnow et al.1 (|2004 ) write down a more 
realistic model for the diversification of vocabulary, accounting for word-borrowing between languages (so 
that their history need not be tree-like) but there is to date no fitting. 



1 Description of the data 

The data group words from 328 meaning categ ories and 24 I ndo-E uropean languages into 872 homology 
classes. The data were collected and coded by Ringe et al. 1 20021 ). Meaning categories cover the "core" 



vocabulary and are assumed relevant to all languages in the study. Two words of closely similar meaning, 
descended from a common ancestor but subject to variation in phonology, are cognate terms. A cognacy 
class is a homology class of words all belonging to a single meaning category. For example, for the meaning 
"head", the Italian testa and the French tete belong to the same cognacy class, while the English head 
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Old English 


stierfp 


Old High German 


stirbit, touwit 


Avestan 


miriiete 


Old Church Slavonic 


umiretu 


Latin 


moritur 


Oscan 


9 



(a) 



Old English 


1 


Old High German 


1 1 


Avestan 


1 


Old Church Slavonic 


1 


Latin 


1 


Oscan 


? ? ? 



(b) 



Table 1: An example of data coding: (a), the word "he dies" in six Indo-European languages; 
(b), the coding of this data as a binary matrix with ?'s for missing data. In the nota- 
tion of Section QJ the first column in (b) is a cognacy class M^ es € ^dies w ^ n ^dies = 
{{Old English, Old High German}, {Old English, Old High German, Oscan}} 



and the Swedish huvud belong to another cognacy class. An element of a cognacy class is thus a word in a 
particular language, a sound-meaning pair. These elements are called cognates. The vocabulary of a single 
language is represented as a set of distinct cognates. In our analysis a cognate has just two properties: its 
language and its cognacy class. If there are N distinct cognacy classes in data for L languages, then the 
a'th class M a C {1, 2, L} is a list of the indices of languages which possess a cognate in that class. The 
data are coded as a binary matrix D. A row corresponds to a language and a column to a cognacy class, 
so that Di a = 1 if the a'th cognacy class has an instance in the i'th language, and Di a = otherwise. See 
TableQ]for an example. This coding allows a language to have several words for one meaning (such as Old 
High German stirbit and touwit for "he dies", an instance of polymorphism), or no word at all. Missing 
matrix elements arise because the reconstructed vocabularies of some ancient languages are incomplete. 
If we are unable to answer the question "does language i possess a cognate in cognacy class a" then we 
set A, a =?■ 

We need notation for both matrix and set representations with missing data. Denote by B a column a 
of L x N matrix B. For a = 1, 2, N let V a be the set of all column vectors d* allowed by the data D a 
in column a of D, 

V a = {d* e {0, 1} L : A,a G {0, 1} d* a = A,a, * = 1, 2, ...,£} . 

For d* e V a let m(d*) = {i : d* = 1}. Denote by Q a the set of cognacy classes consistent with the data 
D a , so that 

n a = {w C {1, 2, . . . , N} : (J = m(d*), d* e V a }. 

The data D are then equivalently = (£!i,fi2, ■■■,Qn)- The f2 a -notation generalizes the M a -notation to 
ha ndle missing data. It is illustrated in Table [U 

Ringe et al. ( 2002h list twenty four mostly ancient languages. For eleven of these languages (Latin, 



Modern Latvian, Old Norse...), all the matrix entries are recorded. For the rest, the proportion of missing 
entries varies between 1% (for Old Irish) and 91% (for Lycian, an ancient language of Anatolia). Note 
that data are usually missing in small blocks corresponding to the cognacy classes for a given meaning 
category, as in Tab l eCD W e do not model this aspect of the missing data. This is related to the model-error 
Nicholls and Grayl <|2008h call 'the empty-field approximation', under which cognacy classes in the same 



meaning category are assumed to evolve independently. 
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For these Indo-European phylogenies, the topology of some subtrees are known from historical records. 
We have lower and upper bounds for the age of the root node of some subtrees. For example, the Slavic 
languages are known to form a subtree, and the most recent common ancestor of the Slavic languages 
in the data is known to be at least 1300 years old. In this way internal nodes of the phylogeny are 
constrained. We have age bounds for leaf nodes as well, since we are told when the vocabulary of the 
ancient languages in these data was in use. We combine these calibration constraints with the cognacy 
class data in section[2J] Jumping ahead to our results, Figure[5]is a sample from the posterior distribution 
we find for phylogenies. Calibration constraints are represented by the black bars across nodes in this 
tree. 



2 Models 

We specify a subjective prior for phylogenies, representing a state of knowledge of interest to us. The 
model we give for the diversification of vocabulary in Section [2721 is. in contrast, generative. We model 
the diversification of vocabulary as a branching process of sets of cognates, evolving on the phylogeny. 
Leaves are labeled by languages and a branch represents the ancestral lineage of a vocabulary. 

2.1 Prior distribution on trees 



The material in this subsection follows Nicholls and Gray 1 20Qgh . We consider a rooted binary tree 
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with 2L nodes: L leaves, L — 2 internal nodes, a root node r = 2L — 1 and an Adam node A = 2L, 
which is linked to r by a edge of infinite length. Each node i = 1,2, ...,2L is assigned an age i; and 
t = (ti,t2, ■■■j^a); the units of age are years before the present; for the Adam node, t& = +oo. The edge 
between parent node j and child node i is a directed branch of the phylogeny, with the orderings 
i < j and ti < tj. Let E be the set of all edges, including the edge (r, A), let V be the set of all nodes and 
Vi = {1, 2, L} the set of all leaf nodes. Our base tree space F is the set of all rooted directed binary 
trees g = (E,V,t), with distinguishable leaves and, for i = 1,2, ...,2L, node ages i< > assigned so that 
the directed path from a leaf i £ Vl to node A passes through nodes of strictly increasing age. With this 
numbering convention, (E 7 V) is called the ordered history of a rooted directed binary tree. 

We allow for C calibration constraints on tree-topology and selected node ages. These are described 
at the end of Section [TJ Each such constraint c allows trees in just some subspace r' c ' C V of tree 
space. We add to these constraints an upper bound on the root time at some age T > 0, and set 
r<°) = {(E, V, t) £ r : t r < T}. The space of calibrated phylogenies with catastrophes is then 

c=0 

The root age is a sensitive statistic in this inference. A prior distribution on trees, with the property 
that the marginal distribution of the root age is uniform over a fixed prior range < t r < T, is of 
interest. For node i £ V, let tf = sup ffgr c ij and = inf ffgr c ti be the greatest and least admissible 
ages for node i, and let S = {i £ V : tf = T}, so that S is the set of nodes having ages n ot bounded 
above by a calibration (there are 12 such nodes in Figure ©). iNicholls and Grayf l|2008F l show that, 



before calibration (when (7 = 0) and for tree spaces in which the leaves have fixed equal ages, the prior 
probability distribution with density 



f G (g\T)^l[(t r -tr 



ies 
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gives a marginal density for t r which is exactly uniform in < t r < T. Nicholls and Gray ( 2008h 



do not comment on the distribution determined by fa over tree topologies. The prior fc has in this 
case (C = 0) a uniform marginal distribution on ordered histories exactly equal to the corresponding 
distribution for the Yule model. This is not uniform, but favors balanced leaf-labeled topologies. For 
C > 0, and on c atastrophe-free tree space s in which leaf ages are constrained by prior knowledge to lie in 
an interval only, Nicholls and Gray ( 2008h argue from simulation studies that this prior gives a reasonably 



flat marginal distribution for t r if in addition T 3> max ig y\s tf (the greatest upper bound among the 
calibration constraints is not too close to the upper bound on the root age). We give a sample from the 
prior in the Supplementary Material; the prior does not represent any reasonable prior belief before about 
4500 BP, but this is immaterial as the likelihood rules these values out (an instance of an application of 
the principle of sufficient reason) . 

2.2 Diversification of cognacy classes 

In this subsection we extend the stochastic Dollo model of Nicholls and Gray 1 200§t ) to incorporate rate 
heterogeneity in time and space, via a catastrophe process. 

Consider the evolution of cognates down the ancestral lineage of the vocabulary of a single language. 
An example is given in Figure [TJ A new cognacy class is born when its first cognate is born. This new 
word is not cognate with other words in the modeled process. Loan words from outside the core meaning 
categories of any language in the study, or from a language outside the study, may be good for word birth 
without violating this condition. A cognate dies in a particular vocabulary when it ceases to be used for 
the meaning common to its cognacy class: its meaning may change, or it may fall out of use. 

Cognates evolving in a single language (ie down a single branch of a language phylogeny) are born 
independently at rate A, die independently at per capita rate p, and are subject to point-like catastrophes, 
which they encounter at rate p along a branch. At a catastrophe, each cognate dies independently with 
probability k, and a Poisson number of cognates with mean v are born. At a branching event of the 
phylogeny, the set of cognates representing the branching vocabulary is copied into each of the daughter 
languages. See Figure [TJ 

The process we have described is not reversible, and this greatly complicates the analysis. It seems 
acceptable, from a data-modeling perspective, to impose the condition v = kA//i, which is necessary 
and sufficient for reversibility (see Supplementary Material for a proof). Under this condition, adding a 
catastrophe to an edge is equivalent to lengthening that edge by Tc{k,h) = — log(l — years. This 
follows because the number of cognates generated by the anagenic part of the process in an interval of 
length Tc is Poisson distributed with mean £(1 — e~^ Tc ) equal to kA//x, and the probability that a cognate 
entering an interval of length Tc dies during that interval is 1 — e~^ Tc , which equals n. 

Because a catastrophe simply extends its edge by a block of virtual time, the likelihood depends 
only on the number of catastrophes on an edge, and not their location in time. Let ki be the number 
of catastrophes on edge (i, j), and k = {k\, . . . ,/c2L-2) be the catastrophe state vector. We record no 
catastrophes on the (i?, A) edge (its length is already infinite). The tree g = (V, E, t, k) is specified by its 
topology, node ages and catastrophe state. Calibrated tree space extended for catastrophes is 

T c K = {{V,E,t,k) : (y,£,f)er c ,^Nf- 2 }. 

We drop the catastrophe process from the calculation in Section 03 It is straightforward to restore it, and 
we do this in the expression for the posterior distribution in Section HJ 
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Figure 1: Description of the model: births and deaths of traits are marked. The dots correspond to 
catastrophes, at which multiple births and deaths may occur simultaneously. The cognate sets this 
generates at leaves are shown on the right. Calendar time flows from left to right and the age variables U 
in the text increase from right to left. 



2.3 The registration process 

Let D* denote a notional full random binary data matrix, representing the outcome of the diversification 
process of Section [231 The number of columns in D* is random, and equal to N* . For the realization 
depicted in Figure [TJ D* = D* with D* displayed in Figure El Note that D* has a column for each 
cognacy class present at the root node, or born below it, whether or not the cognacy class has any 
cognates represented in any leaf languages. 

We call the random mapping of the unknown full representation D* to the registered data D, which 
is a random matrix with N columns, the registration process. There are two stages to this process: the 
masking of matrix elements of the fully realised data D* with ?'s to form an intermediate data matrix D, 
and the selection of columns from D to determine the realised data D. 

Let I* be a random L x N* indicator matrix of independent Bernoulli random variables for observed 
elements. The zeros of I* mark matrix entries in D* which will be unobservable. For i = 1,2, ...,L and 
given a = 1,2,...,^*, let £; = Pr(I* a = 1). The probability £, that we can answer the question, "does 
language i possess a word in the a'th cognacy class?", is assumed to be a function of the language index i 
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Figure 2: Registration of the vocabulary realized in Figurefflsupposing a masking matrix /*, as above. D* 
is the unobserved full data with a column for each cognacy class and a row for each language of Figure Q] 
(thus cognate 1 is present in rows 1,2,6,7 and 8); zeros in the masking matrix I* indicate missing matrix 
elements. Some cognacy classes are then thinned (in this example, the registration rule keeps cognacy 
classes with instances displayed in one or more language) to give the registered data D. 

only. If we get an answer, it is assumed correct. Let £ = (£i, . . . , £l) and denote by I* a realisation of I*. 
Denote by D = D(D*,I*) the masked version of the full random data matrix: if I* a = 1 then D i:a = D* a 

and if I* a = then Di >a =?. 

Matrix columns may be missing too, so that TV < N*. We get missing columns even when there are 
no missing data. The matrix D in Figure [2] includes some columns with only zeros and question marks, 
corresponding to cognacy classes which existed in the past but are observed in none of the leaf-languages 
from which the data were compiled. These cognacy classes are not included in the registered data. Denote 
by R the registration rule D = R(D) mapping the full data to registered data. 

Rule R may thin additional column types. Let Y and Q be functions of the columns of D* and /* 
counting the visible l's and ?'s respectively, 

L 

Y(B* a X) = Ew,«> 

i=l 
L 

Q(I* a ) = 5>-It ). 

i=l 

Given a = 1,2, ...,7V*, let Y a = F(D*,I*) and Q a = Q(I*). 

In Appendix [A] we give an efficient algorithm for computing the likelihood for rules formed by com- 
pounding the following elementary thinning operations: 

(1) Ri{D) = (D a : Y a > 0) (discard classes with no instances at the leaves); 

(2) i?2(-D) = {D a '■ Y a > 1) (discard classes - singletons - observed at a single leaf); 

(3) i?3(-D) = (D a :Y a <L) (discard classes which are observed at all leaves); 

(4) Ra{D) = {D a :Y a < L — 1) (discard classes which are observed at all leaves or at all leaves but one); 

(5) R§{D) = {D a :Y a + Q a <L) (discard classes which are potentially present at all leaves); 
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(6) R§{D) = (D a :Y a + Q a <L — l) (discard classes which are potentially present at all leaves or at all 
leaves but one). 

We assume the chosen rule includes Condition (1). The rule D = R(D) with 

R(D) = R 4 oR 2 (D) 

collects "parsimony informati ve" cognacy c lasses. iRonquist et al.l 1 20051 ) give the likelihood for the finite- 
sites trait evolution model of iLewisI (|200lh for registration rules like (1-6). In the example in Figure [2j 
and in Sections [6.21 and l7j) we fit data registered with R(D) = Ri(D). 

The selection of columns is something we have in general no control over: the column selection rule 



simply describes what happened at registration. Results in the Supplementary Material for the lDyen et al 
(|l997l ) data use the rule R(D) = R2(D), since singleton columns were not included in that data. However 



certain column types may include data which is hard to model well, and so we may choose to make further 
thinning using the other rules. Recursions for the other rules are given in an Appendix. 

Column indices a = 1, 2, ...,7V* are exchangeable. It is convenient to renumber the columns of D* , I* 
and D after registration, so that D a = D a and I* = I a for a = 1,2, ...,7V. The information needed to 
evaluate Y a and Q a is available in the column D a and set fl a representations. We write Y(D a ) = Y(Q, a ) = 
Y a and Q(D a ) = Q{n a ) = Q a . 



2.4 Point process of births for registered cognacy classes 

Fix a catastrophe-free phylogeny g £ T^, with k = (0, 0, 0), and let an edge and a time r £ [U,tj) 
be given. Denote by [g] the set of all points (r, i) on the phylogeny, including points (t, r) with r > t r in 
the edge (r, A). The locations zd = {z\, Z2, zn} of the birth events of the TV registered cognacy classes 
are a realization of an inhomogeneous Poisson point process Zd on [g] . Let Z £ [g] be the birth location 
of a generic (and possibly unregistered) cognacy class M C {1, 2, L}, corresponding to a column of D 
with Y observed l's and Q ?'s, and let £z be the event that this class generates a column of the registered 
data. For the a'th cognacy class M a , born at Z a , this event is £z a — {D a = R(D a )} since R(D a ) is empty 
for D Q a column dropped at registration. 

Cognacy classes are born at constant rate on the branches of g, but are thinned by registration. 
However, conditional on the birth location Z a = z a , our modeling assumes that the outcome £z a for the 
a'th cognacy class is decided independently of all events in all other cognacy classes. The point process 
Zd of birth locations of registered cognacy classes has intensity 

\(z) = XPv(£ z \g, t i,X,^Z = z) 

at z £ [g] and probability density 

1 N 
fz D (zD) = j^e~ A ^l[\(z a ) 

a=l 

with respect to the element of volume dzo = dz\dz2---dzjq in [g] , where 

A([g}) = I \(z)dz 
J[g] 

= Yl M(r,i))dr. 

(i,j)£E Jt ' 

The number N of registered cognacy classes is N ~ Poisson(A([g])). 
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3 Likelihood calculations 



We give the likelihood for g, fx, A, k, p and £ given the data, along with an efficient algorithm to compute 
the sum over all missing data. The catastrophe process is left out, and reincorporated in the next section. 

Since we only ever see registered data, the likelihood for g, A, p, and £ is the probability P[D = 
D\g : fi, A,£,D = i?(D)], to get data D given the parameters and conditional on the data having passed 
registration. We restore the birth locations (and so omit A from the conditioning), and factorize using 
the joint independence of D a , a = 1, 2, TV under the given conditions: 

P[D = P| ff ,/i,A,£,D = P(D)] = j f ZD (z D )P[T> = D\g,ti,Z,Z D = z D ,-D = R(D)}dz D , 

e -A([sD JL r 

[X P[£ Za \g,fj,,^,Z a = z a ]P[T) a =D a \g,fi,^,Z a =z a ,£z a ]dZa, 
a=l J \a] 

N 

TA / P\p a = D a \g,n,Z,Z a = z a \dz a . 
, Ji„] 



a=l 



-Mis]) 



a=l 



The last line follows because P[£zJD a = D a ,...,Z a = z a ] = 1 for D a a column of registered data: if 
the outcome of the birth at z a was the registerable data D a then the event £ Za certainly occurs. The 
likelihood depends on the awkward condition D = P(D) through the mean number \{[g\) of registered 
cognacy classes only, while P[D a = D a \g, /Lt, £, Z a = z a ] is the probability to realise the data vector 
D a in the uncond i tioned diversification/missing element process. The calculation has so far extended 
Nicho lls and Gravl ()2008h to give the likelihood for a greater variety of column thinning rules. We now 



add the missing element component of the registration process. 

We sum over possible values of the missing matrix elements in the registered data. Since P[D a = 
D a \g, /x, A, £, Z a = z a ] is not conditioned on the requirement that the column D a gets registered, the 
entries of the corresponding column I a are determined by the unconditioned Bernoulli process, and we 
have 



P[D Q = D a \g,fJ.,£„Za = Za] = ^ P[I* , D* = d* \g, fj,, £, Z a = z a ] 

d*ev a 

L 

= n£ i "( 1 -&) 1 ~ J< " E PlV* a =d*\g,n,Z,Z a 



i=l d*£V a 

The likelihood is 

-A([ 3 ]) N L . 

P[D = D\g, (A, A, f, D = R(D)} = — — [] ([] (l-^-^X / £ P[M = u\g, ft,Z,Z = z a ]dz a , 

o=i i=i J \s) ujen a 

(1) 

where we have switched now from summing d* € V a to the equivalent set representation ui G fl a - 

For the two integrated q uantities in Equati on (Q]) we have tractable recursive formulae. We are using 
a pruning procedure akin to Felsensteml 1 1981 ). We begin with A([<?]). 



We assume the registration rule includes at least Condition (1). It follows that a cognacy class born 
at Z = (r, i) in [g] must survive down to the node below, at Z = (ti, i), in order to be registered, and so 

P\£ Z \Z = (r, i), g, M , £] - P\£ Z \Z = (t it i), g, M , ^e^^-*'). 
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We can substitute this into the expression for A( [<?]), and integrate, to get 

A M> = ~ £ P[£ z \Z={t h il9,^]^-e-^- u) ). (2) 

Given a node i, let be the set of leaf nodes descended from i, including i itself if i is a leaf. Let 
Si = card(^ l) ). Denote by u| n) = P[Y = n\Z = (t h i), g, fx, f] and = P[Y + Q = n\Z = (U, i), g, n, £]. 
We can compute A for rules made up of combinations of Condition (1) with any combination of Conditions 
(2-6), from uf\ uf~ , u'f 1 1 \ u[ s, \ ^ and v[ s *\ For example, 

(l-uf R = Ri, 

P[ez\Z = (U,i),g,H,t] = l R = R 2 , (3) 

1 „(°) „W „ (L " 1) R-R nR 1 

^ 1 — U\ — u i — u i — u i H = /I3 O tl2 ■ ) 

Notice that = unless si > n, so for example u\ L ' is non-zero at i the root node only. Since our 
main application is for data registered under Condition (1), we give, in the body of this paper, recursions 
for u!- and only. See Appendix [A] for the recursions needed to evaluate the likelihood under rules 
involving Conditions (2-6). 

For nodes i and j, let Sij = e -/1 w -ii ). Consider a pair of edges (ci,«), (c 2 , i) in E. 



(o) 
u) 



= ((1 - <W + 5 itC1 u<g>) ((1 - S hC2 ) + tf <iCa u£>) (4) 



W C°) = I A „(°) 



- vi - Si,*) n + a - n & ] ( 5 ) 

ieY? j V iev; oa 



The recursion is evaluated from the leaves i 6 Vl, at which 



CO) 
,(°) 



i - 6 (6) 



v?> = (7) 

We now give the equivalent recursions for A Jj XLefi a ■f > [^ = ^a\Z = z a ,g,/j] dz a . Consider the set 
Tn a = PL en w of leaves known to have a cognate in the ath registered cognacy class. Let E a be the set 
of branches on the path from the most recent common ancestor of the leaves in m a up to the Adam-node 
A above the root. Cognacy class a must have been born on an edge in E a . For a = 1, 2, N, class M a 
is non-empty and we can shift the birth location to the node below and convert the integral to a sum, 



A 



/ £ P[M = w\Z = Za,g,i4dza = - J2 P[M = u\Z= (M),<7,M](1-<M- 

J\fi\ , A* /. .v^c , 



For each a = 1,2, iV and w e Q a , let = wn v£ l) and 

fi« = : = w n w G fi a } 

denote the set of all subsets of the leaves which are cognacy classes consistent with the data 
available for those leaves. Consider two child branches (ci,i) and (c2, i) at node i. Since 
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and events are independent along the two branches, 

P[M = Lj\Z=(U,i),g,(i,£\ = J2 PW = ^ ] \Z = [U,cx),g,n] 
Luen a w( c i)enl ci) 

x P[M = w^\Z = (fi,ca),g,ft]. 

Having moved the birth event at (U,i) to (t,,ci) and (U,C2) (off the node and onto its child edges) we 
now move the birth event at (£s,c) to (t c , c) (down an edge) as follows: 

]T P[M = cj\Z=(t i ,c),g, f 4 = 
) 

' S itC x ]T P[M = u\Z=(t c ,c),g,ii] ifF(^i c) )>l 

(1 - * <iC ) + 6 i>c X p l M = "\ z =( t c,c),g,t4 iiY(n i a C) )^OandQ(n i a c) )>l 



(0) 



(1 - 6i,c) + ^t, C V 



The recursion is evaluated from the leaves. If c is a leaf, then 



if y(tti c) ) + Q(f>i c) ) = o 

(i.e. fti c) = {0}) 



^ l0 iffii c) = {0} (i.e. £> c ,„ = 0). 

In order to restore catastrophes to this calculation, and given g € IV, with fc» catastrophes on edge 
(i,j) G J5, replace tj — ti with tj — U + kiTc{n, p) throughout. 



4 Posterior distribution 

Our prior on the birth rate A, death rate p and catastrophe rate p is p(X,p,p) oc and we take a 
uniform prior over [0, 1] for the death probability at a catastrophe k and each missing data parameter 

Substituting using equations ©-([I} into equation (Q]) and multiplying by the prior fa(g\T)p(X, p, p), 
we obtain the posterior distribution 

p(g,(j,,\,K,p,£\D = D) 



1 ( A 



N 



N\ \pj I /i 



exp ( -- J2 P[£z\Z=(t i ,i),g, l x,K,m-e-^- ti+kiTa) ) 
j)eE 



N 



X II E E P [ M = ^=(*- i )^ ) M](l-e-^- ti+ferc) ) 



1 . , ,^e-'l»l( p | 5 |)*- 



x—f G (jg\T) 
pAp kt' 



(8) 
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for parameters p, A, K, p > 0, < & < 1 and trees <? 6 T^- . 

The posterior is improper without bounds on p since kr — is allowed. We place very conservative 
bounds on p. Results are not sensitive to this choice. We can show that, for 'typical' data sets D, and 
in particular the data analysed below, the posterior is then proper. Details of the relationship between 
cognate classes and calibration constraints play a role in the conditions for the posterior distribution to 
be proper. 



5 Markov Chain Monte Carlo 

We use Markov Chain Monte Carlo to sample the posterior distribution and estimate summary statistics. 
If the prior on the cognacy class birth rate parameter A has the conjugate form A" cxp(— vX) then the 
conditional distribution of A in the posterior distribution above has the form X N+U exp(— X(X + v)). 
We took the improper prior u = — 1, v = for A and integrated. The MCMC state is then x = 
((E, V, t, k), p, K,p,£) and the target distribution p(x\D) is the density obtained by integrating the density 
in Equation [8] over A. 

The MCMC sampler de scribed inlNicholls and Gravl l|2008t ) has state x=((E, V, t),p). We add to the 



(E,V,t)- and /^-updates of iNicholls and Gray l|2008f ) further MCMC updates acting on the catastrophe 



vector k = (ki, kL-2), on the catastrophe parameters K,p and on £, the probability parameter for 
observable data-matrix elements. The catastrophe rate parameter p is added to time-scaling updates in 
which subsets of parameters are simultaneously scaled by a common random factor s: if 9 is a parameter 
in the scaled subset having units [time] 11 , then 6 — > s u 0. The probability < < 1 for an element of 
the registered data matrix to be observable is, for many leaf-languages, close to one, so we update those 
parameters by scaling 1 — 

We incorporate updates adding and deleting catastrophes (the filled dots marked on the branches of 
Figure [J) plus an update which moves a catastrophe from an edge to a parent, child or sibling edge. For 
the addition and deletion of catastrophes, we do not need to use reversible jump Markov Chain Monte 
Carlo, as the state vector specifies the numbers, and not the locations, of catastrophes on edges. 

We omit the details of these moves but give, as an example, the update that moves a catastrophe 
from an edge to a parent, child or sibling edge. Let kr = J2i=i the total number of catastrophes. 

Given a state x = (g, p, K, p,£) with g = (V,E,t,k), we pick edge e E with probability h/kx- Let 
be the set of edges neighbouring edge (child, sibling and parent edges, but excluding the edge 
(R,A)) and let qi = card(-E?/jj)). We have in general qi = 4. However, for i the index of a leaf node, 
qi = 2 (1 parent, 1 sibling, no children). If j is the root and i is non-leaf, then qi = 3 (1 sibling, 2 children) 
and if j is the root and i is a leaf we have qi = 1 (a sibling edge). Choose a neighbouring edge 
uniformly at random from Euj) and move one catastrophe from to The candidate state is 

x' = ((V, E 7 1, k'), p, k, p), with k[ = ki — 1 and k'~. = kj + 1 and k'j = kj for j ^ i, i. This move is accepted 
with probability 

/ qi kip{x'\D) 
a{x \x) = mm 1, — , , , 

We assessed converge nce with the a symptotic behaviour of the autocorrelation for the parameters p, k, 



and 4r, as suggested bv lGeverll|l992h . This method indicated that we could use runs of about 10 million 



P' 

samples; we also let the MCMC run for 100 million samples and checked that the computed statistics did 
not vary. 
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6 Validation 



We made a number of tests using synthetic data. Fitting the model to synthetic data simulated according 
to the likelihood P(D = D\g,[i, X,p, K, £, D = R(D)), (in-model data), shows us just how informative 
the data is for catastrophe placement, as well as making a debug-check on our implementation. We fit 
out-of-model data also. These are synthetic data simulated under likely model-violation scenarios, and are 
used to identify sources of systematic bias. We summarise results for synthetic data simulated using the 
parameter values we estimate in section[7]on the data. For in-model data we correctly reconstruct topology, 
root age and the number and position of catastrophes. Further details are given in the Supplementary 
Material. 



6.1 Model mis- specification 

Nicholls and Gray use out-of-model data representing un- modeled loan-words (called borrowing), 



rate-heterogeneity in time and space, rate heterogeneity across cognacy classes, and the empty-field ap- 
proximation. They discuss also model mis-specification due to missing data and the incorrect identification 
of cognacy classes (in particular, a hazard for deeply rooted classes to be split). Rate-heterogeneity in 
time and space, and missing data are now part of the in-model an alysis. 

We synthesized out-of-model data with the borrowing model of Nicholls and Gray ( 2008h . in order to 
see if un- modeled borrowing biased our results. For what Nicholls and Grayl l|2008t ) call low to moderate 
levels of borrowing, we were able to reconstruct true parameter values well. See the Supplementary 
Material for details. With high levels of borrowing, we under-estimate the root age and over-estimate rate 
parameters. However, unidentified loan words in the registered data do not generate model mispecification 
unless they are copies of cognates which occur in the observed meaning categories of languages ancestral 
to the leaf-languages. It is no t plausible that un i dentifi ed borrowing of this kind is present at high levels. 

We have not repeated the Nicholls and Gray ( 2008^ 1 out-of-model analysis of rate heterogeneity across 
cognacy classes. 

In a real vocabulary, distinct cognacy classes that share a meaning category would not evolve indepen- 
dently. Also, a real language might be expected to a possess a word in each of the core meaning categories. 
This constrains the numbe r of cognacy classes i n eac h meaning category to be non-zero. Our model allows 
empty meaning categories. Nicholls and Gray ( 20081 ) find no substantial bias in a fit to out-of-model data 
respecting the constraint. 

Our treatment of missing data introduces a new mis-specification. We modeled matrix elements as 
missing independently. However, we get missing data when we do not know the word used in a given 
language to cover a given meaning category. Matrix elements are as a consequence typically missing 
in blocks corresponding to all the cognacy classes for the given meaning. Because the ages of poorly 
reconstructed languages are well predicted in the cross-validation study below, we have not looked further 
at this issue. 



6.2 Prediction tests for calibration 

In the next section we estimate the age of a tree node (the root). In this sectio nwe test to see if the 
uncertainties we estimate are reliable. 

The calibration data described in Section {TJ fix the topologies and root ages of the subtrees marked 
with bars in Figure [5J. We remove each calibration constraint in turn and use the data and the remaining 
constraints to estimate the age of the constrained node, and the probability for the constrained subtree 
topology. The topological constraints were all perfectly reconstructed. In twenty three of the twenty eight 
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tests the constrained age interval overlaps the 95% HPD interval, as shown in the bottom half of Figure 

El 

How good or bad are each of these predictions? Looking at the Hittite prediction in Figure El the large 
prediction uncertainty only just allows the calibration interval. Is this bad? We quantify the goodness- 
of-flt, for each calibration, using Bayes factors to replace p-values as indices of misfit. For each constraint 
c = 1,2, C we compute a Bayes factor measuring the support for the fully constrained model compared 
to a model with just the c'th constraint removed. 

For c= 1,2,...,C let 

c 

y°- c = p| r (c,) . 

denote the enlarged tree space with the c'th constraint removed, and let r^~ c be the enlarged tree space, 
extended to include catastrophes (as in Section fl2.2|l ). For each constraint c=l,2,...,Cwe make a model 
comparison between a common null model with all the constraints, H n : g 6 T^, and an alternative model 
H\ '■ g £ r^,~ c with the constraint removed. The Bayes factor Bc,c-c for the model comparison is the 
ratio of the posterior probabilities for these models with model prior P(Hq) = P(Hi), 



B. 



P(D\g e r£) 



P(D\g € r£" c ) 

p(i% ergnr£- c ) 

P(D\g G T C K ~ C ) 

P{g^T c K \D,g^T c K - c ) 
P(ger c K \ g er%- c ) ' 

where the second line follows since C r^~ c and the third from the definition of the conditional 
probabilities. The numerator P(g £ r^|Z?,g G r^~ c ) is the posterior probability for the c'th constraint 
to be satisfied given the data and the other constraints. The denominator, P(g G T^-\g G T^-), is the 
prior probability for the c'th constraint to be satisfied given the other constraints. We estimate these 
probabilities using simulation of the posterior and prior distributions with constraint c removed. The 
Bayes factors are estimated with negligible uncertainty and 2 log(Bc,C-c) is plotted for c = 1,2, ...,C in 
the top half of Figure El 

Strong evidence against a calibration is failure at prediction. Taking a Bayes factor ex ceeding 1 2 (tha t 
is, 2log(Bc,c-c) ^ 5 in Figure EJ as strong evidence against the constraint, following Rafterv 1 19961 ). 



we have conflict for three of the thirty constraints: the ages of Old Irish and Avestan, and for the age 
of the Balto-Slav clade. As our analysis in Section [7] shows, there is a high posterior probability that a 
catastrophe event occurred on the branch between Old Irish and Welsh, and another between Old Persian 
and Avestan. The evidence for rate heterogeneity in rest of the tree is so slight, that when we try to 
predict these calibrations we are predicting atypical events. 

Our missing-data analysis has helped here. The calibration interval for the Hittite vocabulary in 
these data is 3200-3700BP. A reconstruction of the age of Hittite which ignores missing data predicts 
60-2010BP, well outside of the constraints. The 95% HPD interval for the age of Hittite in our model is 
430-3250BP, which just overlaps the constraint. The Bayes factor gives odds less than 3:1 against, so the 
evidence against the constraint is 'hardly worth mentioning'. 
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HI TA TB LU LY 01 UM OS LA GK AR GO ON OE OG OS PR AV PE VE CE IT GE WG NW BS BA IR II TG 




Figure 3: Reconstruction of known node ages: top, logarithm of Bayes factors \og(Bc,c-c) for c = 
1, 2, C ; bottom, thin lines show age constraints for different nodes, thick lines show 95% posterior HPD 
interval for the reconstructed dates. The nodes are displayed in the same order as the constraint in Fig. [5| 
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- oscan 

- umbrian 



- lithuanian 

- latvian 



- oldcslavonic 



- luvian 

- hittite 



- tocharianb 

- tocharian a 



Figure 4: Consensus tree for the Rinae et al. \200& ) dataset. Red dots show catastrophes supported with 
probability above one half. 

7 Results 

In this section we present results for our MCMC simulation of the full posterior, Equation An upper 
limit T = 16000 was used in the tree prior of Section |2~T1 Any value for T exceeding around T = 10000 
would lead to the same results. We show a consensus tree in figure [H In a tree, an edge corresponds to 
a split partitioning the leaves into two sets. A consensus tree displays just those splits present in at least 
50% of the posterior sample. Splits which receive less than 95% support are labelled. Where no split is 
present in 50% of the posterior sample, the consensus tree is multifurcating. The date shown for a node 
is the average posterior date given the existence of the split; similarly, the number of catastrophes shown 
on an edge is the average posterior number of catastrophes on that edge given the existence of the split, 
rounded to the nearest integer. Our estimates for the parameters are as follows: n = 1.86-10- 4 ±1.47-10- 5 
deaths/year; K = 0.361 ± 0.055; p = 6.4 • 10 -5 ± 3.3 • 10 -5 catastrophes/year (corresponding to large but 
rare catastrophes: about 1 catastrophe every 15,000 years, or an average of 3.4 on the tree, with each 
catastrophe corresponding to 2400 years of change). 

We display in Figure[5]the calibration constraints on a tree sampled from the posterior. The constraints 
cannot be shown on a consensus tree, as slices across a consensus tree are not isochronous. 

The analysis reconstructs some well-known features of the Indo-European tree. The Germanic, Celtic 



1C 




Figure 5: A typical sample from the posterior. All the constraints on the node ages are shown, except 
those for the Italic, Indo-Iranian and Iranian groups, for which we do not have an upper bound, and the 
root, which has an upper bound at T — 16000 years BP. 
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and Italic families are grouped together, but no particular configuration of their relative positions is 
favored. The Indo-Persian group can fall outside the Balto-Slav group but the relative position of these 
two is uncertain. The deep topology of the tree is left quite uncertain by these data, especially the position 
of Albanian. We find evidence for catastrophic rate heterogeneity in three positions: on the edges leading 
to Old Irish, Old Persian, and in the Umbrian-Oscan clade. 

Our estimate for the root age of the Indo-European family is 8430 ± 1320 years BP. The distribution 
of this key statistic is close to normal. 



8 Conclusions 

Our results give a root age for the most recent common ancestor of the Indo-European family of language 
vocabularies in agreement with earlier phylogenetic studies. Our results are in agreement with models 
which put this date around 8500 BP, and in conflict with models which require it to be less than 6500 
years BP. Our studies of synthetic out-of-model data, and reconstruction tests for known historical data 
support our view that this main result is robust to model error. It would not be robust to a step change 
in the rate of lexical diversification acting in a coordinated fashion across the Indo-European languages 
extant some 3000 to 5000 years ago. 

The methods outlined here for handling missing data and rate heterogeneity in the diversification of 
languages, as seen through lexical data, will find applications to generic trait data. 



A Recursions for other registration processes 

This section complements Section [3j we give iterations for u\ , u[ X \ ti^* , u^^, w- s,_1 ^ and v[ St \ 
These are the quantities needed (as in Equation [3j to evaluate the sum in Equation [2j for registration 
rules which use Condition (1) in combination with other conditions from Section l2~3l Consider a pair of 
edges (ci,i), (c2,i) in E. In the notation of the text, 

^ (0) = ((1 - S itCl ) + «J iiCl <>) ((1 - 5 itC2 ) + 5 itC2 uW) 



ii 



(si_1) - ^W^ 1 1) +Ik 1 =i}(1-<5,,c 1 ))4 C2 ^ C2) 

u iiC1 «(o)+(i-^ cl ) n Uv^+ii-s^) n & 

v ^i) - I x. „,( s <=i) _l n _ x. \ TT n _ c.\ I I x. „,( s <=2) 



kcAT^ + (i - Si, ci ) n o- - &) kcAi a2> + (i - <w n ( j - &) 

(s^vlT 1 ^ + (i - Si, cl ) Sci (i - e) 8 ^-^) (s hC2 vit C2) + (1 - <W(1 - ) 

+ (\ Cl ^ cl) + (1 - S i>Cl )(l - Sci ) (s. ltC2 vt^ 1] + (1 - 6i, C2 )s C2 (l - SC2 'H) 
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The recursion is evaluated from the leaves i £Vl, at which 

(0) («<-i) i c 

(1) c 
(0) 0.-1) n 
(**) 1 
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